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ABSTRACT — In this paper, Carry Tree Adders are Proposed. Parallel prefix adders have the best performance in VLSI 
Design. Parallel prefix adders gives the best performance compared to the Ripple Carry Adder (RCA) and Carry Skip Adder 
(CSA). Here Delay measurements are done for Kogge-Stone Adder, Sparse Kogge-Stone Adder and Spanning Tree Adder. 
Speed of Kogge-Stone adder and Sparse Kogge-Stone adder have improved compared to the Ripple Carry Adder (RCA) and 
Carry Skip Adder (CSA). Model Simulator- Altera 6.6d and Xilinx 1 0.1 tools were used for simulation and synthesis of the 
design. 

Index Terms -Carry Skip Adder (CSA), Kogge-Stone adder, Ripple carry adder (RCA), sparse Kogge-Stone adder and 
Spanning tree adder. 

I. INTRODUCTION 

In VLSI implementations, parallel-prefix adders are known to have the best performance. Reconfigurable logic 
such as -Field Programmable Gate Arrays (FPGAs) has been gaining in popularity in recent years because it offers 
improved -performance in terms of speed and power over DSP-based and microprocessor-based solutions for many 
practical designs involving mobile DSP and telecommunications applications. Parallel-prefix adders will have a 
different performance than VLSI implementations. In particular, most modern FPGAs employ a fast-carry chain which 
optimizes the carry path for the simple Ripple Carry Adder (RCA). 

An efficient testing strategy for evaluating the -performance of these adders is discussed. Several tree-based 
adder structures are implemented and characterized on a FPGA and compared with the Ripple Carry Adder (RCA) 
and the Carry Skip Adder (CSA). Finally, some conclusions and suggestions for improving FPGA designs to 
enable better tree-based adder performance are given. 

II. CARRY-TREE ADDER DESIGNS 

Parallel-prefix adders, also known as carry-tree adders, pre-compute the propagate and generate signals [1]. These 
signals are variously combined using the fundamental carry operator (fco) [2]. 
(GlJ>l)U(Gr,Pr)=(Gl+Pl'Gr,Pl'Pr)(1) 

Due to associative property of the fco, these operators can be combined in different ways to form various adder 
structures. For, example the four-bit carry-look ahead-generator is given by: 

C 4 =(g4, P 4 ) □[ (g 3 , P 3 ) □ [(g2, P 2 ) □ (gl, Pi)] ] (2) 

A simple rearrangement of the order of operations allows parallel operation, resulting in a more efficient tree structure for 
this four bit example: 

c 4 = [(g 4 , p 4 ) D(g3, Ps)] D[(g2, p 2 ) D(gi, Pi)] (3) 

It is readily apparent that a key advantage of the tree structured adder is that the critical path due to the carry delay is 
on the order of log2N for an N-bit wide adder. The arrangement of the prefix network gives rise to various families of 
adders. For a discussion of the various carry -tree structures, see [1,3]. 

For this study, the focus is on the Kogge-Stone adder [4] 

Here we designate BC as the black cell which generates the ordered pair in equation (1); the grey cell (GC) generates the left 
signal only, following [1]. The interconnect area is known to be high, but for an FPGA with large routing overhead to 
begin with, this is not as important as in a VLSI -implementation. The regularity of the Kogge-Stone prefix 
network has built in redundancy which has implications for fault -tolerant designs [5]. The sparse Kogge-Stone adder, 
shown in Fig 2, is also studied. This hybrid design completes the summation process with a 4 bit RCA 
allowing the carry prefix network to be simplified. 
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Fig2. sparse 16 bit Kogge-Stone adder 

Another carry-tree adder known as the spanning tree carry-look ahead (CLA) adder is also examined [6] . Like the 
sparse Kogge-Stone adder, this design terminates with a 4- bit RCA. As the FPGA uses a fast carry-chain for the RCA, it is 
interesting to compare the performance of this adder with the sparse Kogge-Stone and regular Kogge-Stone adders. 
Also of interest for the spanning-tree CLA is its testability feature [7]. 
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Fig3. Spanning Tree Carry Look ahead Adder (16 bit) 
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III. METHOD OF STUDY 
The adders to be studied were designed with varied bit widths up to 128 bits and coded in VHDL. The 
functionality of the designs were verified via simulation with Model Simulator. The Xilinx ISE 10.1 software was used to 
synthesize the designs onto the Spartan 3E FPGA. In order to effectively test for the critical delay, two steps were taken. 
First, a memory block (labelled as ROM in the figure below) was instantiated on the FPGA using the Core Generator to 
allow arbitrary patterns of inputs to be applied to the adder design. A multiplexer at each adder output selects whether or not 
to include the adder in the measured results, as shown in Fig A switch on the FPGA board was wired to the select pin of the 
multiplexers. This allows measurements to be made to subtract out the delay due to the memory, the multiplexer. And 
interconnect (both external cabling and internal routing). 

Second, the parallel prefix network was analysed to determine if a specific pattern could be used to extract the 
worst case delay. Considering the structure of the Generate-Propagate (GP) blocks (i.e., the BC and GC cells), we were able 
to develop the following scheme, by considering the following subset of input values to the GP blocks. 



Tablel: Subset of (g, p) Relations Used for Testing 



( g L,pL) ( g R,pR) 


(gL + pL gR, pL pR) 


(0,1) (0,1) 


(0,1) 


(0,1) (1,0) 


(1,0) 


(1,0) (0,1) 


(1,0) 


(1,0) (1,0) 


(1,0) 



If we arbitrarily assign the (g, p) ordered pairs the values (1,0) = True and (0, 1) = False, then the table is self- 
contained and forms an OR truth table. Furthermore, if both inputs to the GP block are False, then the output is False; 
conversely, if both inputs are True, then the output is True. Hence, an input pattern that alternates between generating 
the (g, p) pairs of (1,0) and (0, 1) will force its GP pair block to alternate states. Likewise, it is easily seen that the GP 
blocks being fed by its predecessors will also alternate states. Therefore, this scheme will ensure that a worse case delay 
will be generated in the parallel prefix network since every block will be active. In order to ensure this scheme works, 
the parallel prefix adders were synthesized with the "Keep Hierarchy" design setting turned on (otherwise, the FPGA 
compiler attempts to reorganize the logic assigned to each LUT). With this option turned on, it ensures that each GP 
block is mapped to one LUT, preserving the basic parallel prefix structure, and ensuring that this test strategy is 
effective for determining the critical delay. The designs were also synthesized for speed rather than area optimization. 

IV. DISCUSSION OF RESULTS 

The simulated adder delays obtained from the Xilinx ISE synthesis reports are shown in Fig. An RCA as large as 
160 bits wide was synthesizable on the FPGA, while a Kogge-Stone adder up to 128 bits wide was implemented. 
The carry-skip adders are compared with the Kogge-Stone adders. The actual measured data appears to be a bit smaller than 
what is predicted by the Xilinx ISE synthesis reports. An analysis of these reports, which give a breakdown of delay due to 
logic and routing, would seem to indicate that at adder widths approaching 256 bits and beyond, the Kogge-Stone 
adder will have superior performance compared to the RCA. Based on the synthesis reports, the delay of the Kogge-Stone 
adder can be predicted by the following equation: 

ths = (n+2)U Dl u t + □ □ Qi) (4) where N = 2n, the adder bit width, ALUT is the delay through a lookup table (LUT), 
and pKs(ft) is the routing delay of the kogge-Stone adder as a function of n. The delay of the RCA can be predicted as: 
tRCA = (N - 2)U □ Qj x + Drca (5) 

where AMUX is the mux delay associated with the fast-carry chain and trc a is a fixed logic delay. There is no routing 
delay assumed for the RCA due to the use of the fast-carry 

chain. For the Spartan 3E FPGA, the synthesis reports give the following values: ALUT = 0.612 ns, AMUX = 0.051 ns, and 
X R c a = 1.715 ns. Even though AMUX « ALUT, it is expected that the Kogge-Stone adder will eventually be faster than 
the RCA because N = 2n, provided that pKs(^) grows relatively slower than (TV - 2) □ AMUX. Indeed, Table II predicts that 
the Kogge-Stone adder will have superior performance at N =256. 



Table2 : Delay Results for the Kogge-Stone Adders 



N 


Synth. 
Predict 


Route 
Delay 


Route 
Fitted 


Delay 

tKS 


Delay 

tRCA 


4 


4.343 


1.895 


1.852 


4.300 


1.817 


16 


6.113 


2.441 


2.614 


6.286 


2.429 


32 


7.607 


3.323 


3.154 


7.438 


3.245 


64 


8.771 


3.875 


3.800 


8.696 


4.877 


128 


10.038 


4.530 


4.552 


10.060 


8.141 


256 






5.410 


11.530 


14.669 



(all delays given in ns) 
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The second and third columns represent the total predicted delay and the delay due to routing only for the Kogge-Stone 
adder from the synthesis reports of the Xilinx ISE software. The fitted routing delay in column four represents the predicted 
routing delay using a quadratic polynomial in n based on the N=4 to 128 data. This allows the N = 256 routing delay 
to be predicted with some degree of confidence as an actual Kogge-Stone adder at this bit width was not synthesized. 
The final two columns give the predicted adder delays for the Kogge-Stone and RCA using equations (4) and (5), 
respectively. The good match between the measured and simulated data for the implemented Kogge-Stone adders and RCAs 
gives confidence that the predicted superiority of the Kogge-Stone adder at the 256 bit width is accurate. This differs from 
the results in [10], where the parallel prefix adders, including the Kogge-Stone adder, always exhibited inferior performance 
compared with the RCA( simulation results out to 256 bits were reported). The work in [10] did use a different FPGA (Xilinx 
Vertex 5), which may account for some of the differences. The poor performance of some of the other implemented 
adders also deserves some comment. The spanning tree adder is comparable in performance to the Kogge-Stone adder at 
16 bits. However, the spanning tree adder is significantly slower at higher bit widths, according to the simulation results, and 
slightly slower, according to the measured data. The structure of the spanning tree adder results in an extra stage of logic for 
some adder outputs compared to the Kogge-Stone. This fact coupled with the way the FPGA place and route software 
arranges the adder is likely the reason for this significant increase in delay for higher order bit widths. Similarly, the inferior 
performance of the carry-skip adders is due to the LUT delay and routing overhead associated with each carry-skip logic 
structure. Even if the carry-skip logic could be implemented with the fast-carry chain, this would just make it equivalent in 
speed to the RCA. Hence, the RCA delay represents the theoretical lower limit for a carry-skip architecture on an FPGA. 



V. Simulation results 




(a)Ripple-Carry Adder 




(b) Carry-Select Adder 




(c) Carry-Skip Adder 
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(d) Kogge-Stone Adder 




(e) Sparse Kogge-Stone Adder 




(f) Spanning Tree adder 

Figure: (a)-(f): A 16-bit parallel prefix adder simulation result for all combinations outputs. 

For the HDL structural design, the test vectors for excitation has been provided, and the response is as shown in Figure. Here 
the input reference vector is a=00101 101 1 1010101,b=00101 1001 101 1 1 10,for Ripple carry adder, 

a=0010111100111100, b=001 1001 11 1001 111, for Carry select adder, a=0101 1011101 11010,b=001 1011001 101 111 for 
Carry skip adder. 

a=00001 10101 100100,b=0010100001 100100 for Kogge stone adder, 
a=01011 1010101 1000,b=001 10100101 101 11 for sparse kogge stone adder, 
a=0001 1011010101 10,b=0001 100101 1 1 101 1 for panning tree adder. 



Final Results 

RTL Top Level Output File Name 

Top Level Output File Name 

Output Format 

Optimization Goal 

Keep Hierarchy 

Design Statistics 

# IOs : 50 



VI. SYNTHESIS REPORT 

ripple carry adder.ngr 
ripple carry adder 
NGC 
Speed 
No 
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Cell Usage : 

# BELS : 32 

# LUT3 : 32 

# 10 Buffers : 50 

# IBUF : 33 

# OBUF : 17 

Timing constraints 

Delay: 21.69ns (Levels of Logic = 18) 

Source: B<0> (PAD) 

Destination: C out (PAD) 
DataPath: B<0> to C out 



Cell: 
In_>Out 


Fan 
out 


Gate delay 


Net delay 


Logic 

Name(Net 

Name) 


IBUF:I- 

>o 


2 


1.1U6 




13 A I It T TIT 

(B_0_IBUF) 


LUT3:I0- 

>o 


L 


U.OlZ 


A /I/IQ 


r AU/COUtl 

(c<0>) 


LUT3:I1- 

>o 


L 


u.oiz 


A /I/IQ 


t Al/COUtl 

(C<1>) 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FA2/coutl 

(c<2» 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FA3/coutl 

(c<3» 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FA4/coutl 

(c<4>) 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FA5/coutl 

(c<5>) 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FA6/coutl 

(c<6>) 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FA7/coutl 

(c<7>) 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FA8/coutl 

(c<8>) 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FA9/coutl 

(c<9>) 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FAlO/coutl 
(c<10>) 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FAll/coutl 
(c<ll>) 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FA12/coutl 
(c<12>) 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FA13/coutl 
(c<13>) 


LUT3:I1- 

>o 


2 


0.612 


0.449 


FA14/coutl 
(c<14>) 


LUT3:I1- 

>o 


1 


0.612 


0.357 


FA15/coutl 
(c<15>) 


OBUF:I- 

>o 




3.169 




Cout_ 
OBUF 
(Cout) 
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: kogge-tone adder 
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: Speed 
: No 
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Design Statistics 

#IOs 



:50 



Cell Usage: 



# 


BELS 


41 


# 


GND 


:01 


# 


LUT3 


: 27 


# 


LUT4 


:9 


# 


10 Buffers : 


50 


# 


IBUF : 


33 


# 


OBUF 


: 17 



Timing constraints 

Delay: 20.262ns (Levels of Logic = 17) 

Source: b<l> (PAD) 

Destination: Sum<14> (PAD) 
Data Path: b<l> to Sum<14> 





Fan 


Gate 


Net 


Logic name(Net Name) 


Cell: in- 


out 


delay 


delay 




>out 










IBUF:I- 


4 


1.106 


0.651 


b_l_IBUF (b_l_IBUF) 


>o 












1 


0.612 


0.000 


GC2/G1_SW01 


LUT4:I0- 








(GC2/G1_SW0) 


>o 










MUXF5:I1- 


2 


0.278 


0.410 


GC2/Gl_SW0_f5 


>o 








(q<l>) 


LUT3:I2- 


2 


0.612 


0.532 


GC2/G1 (q<2>) 


>o 










LUT3:I0- 


2 


0.612 


0.532 


GC6/G_SW0_SW0 


>o 








(s<3>) 


LUT3:I0- 


2 


0.612 


0.532 


GC7/G_SW0_SW0 


>o 








(s<4>) 


LUT3:I0- 


2 


0.612 


0.532 


GC8/G_SW0_SW0 


>o 








(s<5>) 


LUT3:I0- 


2 


0.612 


0.410 


GC9/G_SW0_SW0 


>o 








(s<6>) 


LUT3:I2- 


3 


0.612 


0.603 


GC9/G_SW0 (v<7>) 


>o 












2 


0.612 


0.410 


GC9/G_SW1 (v<8>) 


LUT3:I0- 










>o 












2 


0.612 


0.410 


GC9/G (v<9>) 


LUT3:I2- 










>o 












2 


0.612 


0.532 


GC12/G_SW0 (v<10>) 


LUT3:I2- 










>o 












2 


0.612 


0.410 


GC12/G_SW1 


LUT3:I0- 








(GC13/G5) 


>o 












2 


0.612 


0.410 


GC12/G (GC14/G9) 


LUT3:I2- 










>o 












2 


0.612 


0.410 


GC14/G18 (GC13/G34) 


LUT3:I2- 










>o 












1 


0.612 


0.357 


Mxor_sum<14>_Resultl 


LUT3:I2- 








(sum_14_OBUF) 


>o 














3.169 




sum_14_OBUF 


OBUF:I- 

>o 








(sum<14>) 
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Final Results 

RTL Top Level Output File Name: sparse kogge- stone 
Adder, ngr 



Top Level Output File Name 

Output Format 

Optimization Goal 

Keep Hierarchy 

Design Statistics 

# IOs : 65 



: sparse kogge 
: NGC 
Speed 
: No 



Cell Usage: 



# 


BELS 


: 54 


# 


LUT2 


:02 


# 


LUT3 


: 30 


# 


LUT4 


:19 


# 


MUXF5 


:03 


# 


10 Buffers 


: 65 


# 


IBUF 


: 33 


# 


OBUF 


: 32 



Timing constraints 

Delay : 15.91 6ns (Levels of Logic =13) 

Source: a<6> (PAD) 

Destination: C<6>t (PAD) 
DataPath: a<6>toC<16> 



Cell: 


Fan 


Gate 


Net 


Logic 


in_>out 


out 


delay 


delay 


Name(Net 
Name) 


IBUF:I- 


4 


1.106 


0.651 


a_6_IBUF 


>o 








(a_6_IBUF) 




2 


0.612 


0.449 


BC8/G18 


LUT4:I0- 








(BC8/G18) 


>o 












1 


0.612 


0.000 


BC8/G461 


LUT4:I1- 








(BC8/G461) 


>o 












3 


0.278 


0.603 


BC8/G46_f5 


MUXF5:I1- 








(BC8/G46) 


>o 












1 


0.612 


0.387 


GC3/C13 


LUT4:I0- 








(GC3/C13) 


>o 












1 


0.612 


0.360 


GC3/C21 


LUT3:I2- 








(GC3/C21) 


>o 












1 


0.612 


0.426 


GC3/C46 


LUT4:I3- 








(GC3/C46) 


>o 












2 


0.612 


0.449 


GC3/C77 


LUT4:I1- 








(GC3/C77) 


>o 












3 


0.612 


0.520 


FA13/coutl 


LUT3:I1- 








(C_13_OBUF) 


>o 












3 


0.612 


0.520 


FA14/coutl 


LUT3:I1- 








(C_14_OBUF) 


>o 












3 


0.612 


0.520 


FA15/coutl 


LUT3:I1- 








(C_15_OBUF) 


>o 












1 


0.612 


0.357 


FA16/coutl 


LUT3:I1- 








(C_16_OBUF) 


>o 










OBUF:I- 




3.169 




C_16_OBUF 


>o 








(C<16>) 



VII. IMPLEMENTATION AND RESULTS 

The proposed design is functionally verified and the results are verified. The timing report was obtained. The Simulation 
Verified in Modelsim and Synthesis was verified in Xilinx. 
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N 


Delay 

tR C A 


Delay 

tKS 


Delay 

tSKS 


16 


21.690ns 


20.262ns 


15.916ns 



VIII. CONCLUSION 

In this paper An improved optimization techniques for parallel prefix adder has been proposed and implemented. 
The design of the proposed prefix adders is done using Ripple carry adder and Kogge -stone adder, Sparse kogge tone adder 
and panning tree adder, speed of parallel prefix adder is increased compared to the Ripple carry adder. The functional 
verification of the proposed design of the An improved optimization techniques for parallel prefix adder is performed 
through simulations using the Verilog HDL flow in ModelSim for prefix adders and Synthesis done using Xilixn.The design 
of An improved optimization technique for parallel prefix adder has been performed. The proposed design of An improved 
optimization techniques for parallel prefix adder can perform ripple carry adder,kogge stone adder, spare kogge 
adder, spanning tree adder ,parallel adder give the better result compared to the ripple carry adder. 
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